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Abstract 

We provide preliminary evidence that existing algorithms for 
inferring small-scale gene regulation networks from gene 
expression data can be adapted to large-scale gene expression data 
coming from hybridization microarrays. The essential steps are (1) 
clustering many genes by their expression time-course data into a 
minimal set of clusters of co -expressed genes, (2) theoretically 
modeling the various conditions under which the time-courses are 
measured using a continious-time analog recurrent neural network 
for the cluster mean time-courses, (3) fitting such a regulatory 
model to the cluster mean time courses by simulated annealing 
with weight decay, and (4) analysing several such fits for 
commonalities in the circuit parameter sets including the 
connection matrices. This procedure can be used to assess the 
adequacy of existing and future gene expression time-course data 
sets for determining transcriptional regulatory relationships such as 
co regulation. 


1 Introduction 

In a cell, genes can be turned “on” or “ofF’ to varying degrees by the protein 
products of other genes. When a gene is “on” it is transcribed to produce messenger 
RNA (mRNA) which can subsequently be translated into protein molecules. Some 
of these proteins are transcription factors which bind to DNA at specific sites and 
thereby affect which genes are transcribed and how often. This trancriptional 
regulation feedback circuitry provides a fundamental mechanism for information 


processing in the cell. It governs differentiation into diverse cell types and many 
other basic biological processes. 

Recently, several new technologies have been developed for measuring the 
"expression” of genes as mRNA or protein product. Improvements in conventional 
fluorescently labeled antibodies against proteins have been coupled with confocal 
microscopy and image processing to partially automate the simultaneous 
measurement of small numbers of proteins in large numbers of individial nuclei in 
the fruit fly Drosophila melanogaster [1], In a complementary way, the mRNA 
levels of thousands of genes, each averaged over many cells, have been measured by 
hybridization arrays for various species including the budding yeast S accharomy ces 
ce rev i siae [2]. 

The high-spatial -resolution protein antibody data has been quantitatively modeled by 
"gene regulation network” circuit models [3] which use continuous -time, analog, 
recurrent neural networks (Hopfield networks without an objective function) to 
model transcriptional regulation [4][5]. This approach requires some machine 
learning technique to infer the circuit parameters from the data, and a particular 
variant of simulated annealing has proven effective [6][7]. Methods in current 
biological use for analysing mRNA hybridization data do not infer regulatory 
relationships, but rather simply cluster together genes with similar patterns of 
expression across ti me an d ex peri ment al conditions [8][9]. In this paper, we explore 
the extension of the gene circuit method to the mRNA hybridization data which has 
much lower spatial resolution but can currently assay a thousand times more genes 
than immuno fluorescent image analysis. 

The essential problem with using the gene circuit method, as employed for 
immuno flourescence data, on hybridization data is that the number of connection 
strength parameters grows between linearly and quadatically in the number of genes 
(depending on sparsity assumptions) . This requires more data on each gene, and 
even if that much data is available, simulated annealing for circuit inference does not 
seem to scale well with the number of unknown parameters. Some form of 
dimensionality reduction is called for. Fortunately dimensionality reckiction is 
available in the present practice of clustering the large-scale time course expression 
data by genes, into gene clusters. In this way one can derive a small number of 
cluster-mean time courses for “aggregated genes”, and then fit a gene regulation 
circuit to these cluster mean time courses. We will discuss details of how this 
analysis can be performed and then interpreted A similar approach using somewhat 
different algorithms for clustering andcircuit inference has been taken by Hertz [10]. 

In the following, we will first summarize the data models and algorithms used, and 
then report on preliminary experiments in applying those algorithms to gene 
expression data for 2467 yeast genes [9 ] [ 1 1 ] . Finally we will discuss prospects for 
and limitations of the approach. 


2 Data Models and Algorithms 


The data model is as follows. We imagine that there is a small, hidden regulatory 
network of "aggregate genes” which regulate one another by the analog neural 
network dynamics [3] 
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in which v. is the continuous -valued state variable for gene product i, T tJ is the 

matrix of positive, zero, or negative connections by which one transcription factor 
can enhance or repress another, and g() is a nonlinear monotonic sigmoidal 
activation function. When a particular matrix entry T (J is nonzero, there is a 

regulatory “connection” from gene product j to gene i . The regulation is enhancing 
if T is positive and repressing if it is negative. If 7^ is zero there is no 

connection. 

This network is run forwards from some initial condition and time-sampled to 
generate a wild-type time course for the aggregate genes. In addition, various other 
time courses can be generated under alternative experimental conditions by 
manipulating the parameters. For example an entire aggregate gene (corresponding 
to a cluster of real genes) could be removed from the circuit or otherwise modified to 
represent mutants. External input conditions could be modeled as modifications to 
h. Thus we get one or several time courses (trajectories) for the aggregate genes. 

From such aggregate time courses, actual gene data is generated by addition of 
Gaussian-distributed noise to the logarithms of the concentration variables. Each 
time point in each cluster has its own scalar standard deviation parameter (and a 
mean arising from the circuit dynamics). Optionally, each gene’s expression data 
may also be multiplied by a time-i ndependent proportionality constant. 

Given this data generation model and suitable gene expression data, the problem is 
to infer gene cluster memberships and the circuit parameters for the aggregate genes’ 
regulatory relationships. Then, we would like to use the inferred cluster 
memberships and regulatory circuitry to make testable biological predictions. 

This data model departs from biological reality in many ways that could prove to be 
important, both for inference and for prediction. First, except for the Gaussian noise 
model, each gene in a cluster is models as fully coregulated with every other one - 
they are influenced in the same ways by the same regulatory connection strengths. 
Second, the nonlinear circuit model must not only reflect transcriptional regulation, 
but all other regulatory circuitry affecting measured gene expression. Such circuitry 
includes protein-protein regulatory interactions such as kinase-phosphatase 
networks, actively controlled protein degradation (proteolysis), translational 
regulation, and intercellular signaling where applicable. 

Under this data model, one could formulate a joint Bayesian inference problem for 
the cl ustering and circuit inference aspects of fitting the data. But given the highly 
provisional nature of the model, we simply apply in sequence an existing mixture- 
of-Gaussians clustering algorithm to preprocess the data and reduce its 
dimensionality, and then an existing gene circuit inference algorithm. Presumably a 
joint optimization algorithm could be obtained by iterating these steps. 

2.1 Clustering 

A widely used clustering algorithm for mixure model estimation is Expectation- 
Maximization (EM)[1 2]. We use EM with a diagonal covariance in the Gaussian, 
so that for each feature vector component a (a combination of experimental condtion 
and time point in a time course) and cluster OC there is a standard deviation 
parameter O a(x . In preprocessing, each concentration data point is divided by its 

value at time zero and then a logarithm taken. The log ratios are clustered using 
EM. Optionally, each gene’s entire feature vector may be normalized to unit length 
and the cluster centers likewise normalized during the iterative EM algorithm. 


In order to choose the number of clusters, k y we use the cross-validation algorithm 
described by Smyth [I 3]. This involves computing the likelihood of each optimized 
fit on a test set andaveraging over runs and over divisions of the data into training 
and test sets. Then, we can examine the likelihood as a function of k in order to 
choose k. Normally one would pick £ so as to maximize cross- validated likelihood. 
However, in the present application we also want to reward small values of k which 
lead to smaller circuits for the circuit inference phase of the algorithm. The choice 
of k will be discussed in the next section. 

2.2 Circuit Inference 

We use the Lam-Delosme variant of simulated annealing (SA) to derive connection 
strengths 7\ time constants T, and decay rates X, as in previous work using this 
gene circuit method [4][5]. We set h to zero. The score function which SA 
optimizes is 

S(T, r,X) = a2(v,-(/;7\t. A)- v,.(r)) 2 + 
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The first term represents the fit to data v r The second term is a standard weight 

decay term. The third term forces solutions to stay within a bounded region in 
weight space. We vary the weight decay coefficient W in order to encourage 
relatively sparse connection matrix solutions. 

3 Results 

3.1 Data 

We used the S acchawmy ces cerevisiae dataset of [9]. It includes three longer time 
courses representing different ways to synchronize the normal cell cycle [1 1], and 
five shorter time courses representing altered conditions. We used all eight time 
courses for clustering, but just 8 time points of one of the longer time courses 
(alpha factor synchronized cell cycle) for the circuit inference. It is likely that 
multiple long time courses under altered condit ions will be required before strong 
biological predictions can be made from in fenred regulatory circuit models. 

3.2 Clustering 

We found that the most likely number of classes as determined by cross validation 
was about 27, but that there is a broad plateau of high-likelihood cluster numbers 
from 15 to 35 (Figure 1). This is similar to our results with another gene 
expression data set for the nematode worm Caenorhctoditis elegans supplied by 
Stuart Kim; these more extensive clustering experiments are summarized in Figure 
2. Clustering experiments with synthetic data is used to understand these results. 
These experiments show that the cross- validated log likelihood curve can indicate the 
number of clusters present in the data, justifying the use of the curve for that 
purpose. In more detail, synthetic data generated from 14 20-dimensional spherical 
Gaussian clusters were clustered using the EM/CV algorithm. The likelihoods 
showed a sharp peak at &=14 unlike Figures 1 or 2. In another experiment , 14 20- 
dimensional spherical Gaussian superclusters were used to generate second-level 
clusters (3 subclusters per supercluster), which in turn generated synthetic data 


points. This two-level hierarchical model was then clustered with the EM/CV 
method. The likelihood curves (not shown) were quite similar to Figures 1 and 2, 
with a higher-likelihood plateau from roughly 14 to 40. 



Figure 1. Cross- validated log-likelihood scores, displayed and averaged over 5 runs, 
for EM clustering of S. cerevisiae gene expression data [9]. Horizontal axis: k , the 
“requested” or maximal number of cluster centers in the fit. Some cluster centers go 
unmatched to data. Vertical axis: log likelihood score for the fit, scatterplotted and 

averaged Likelihoods have not been 




integrated over any range of parameters for 
hypothesis testing, k ranges from 2 to 40 in 

increments of 1 . Solid line shows average likeli hood value for each k. 


Figure 2. Cross- validated log-likelihood scores, averaged over 13 runs, for EM 
cl ustering of C. elegcns gene expression datafromS. Kim’s lab. Horizontal axis: fc, the 
“requested” or maximal number of cluster centers in the fit. Some cluster centers go 
un matched to data. Vertical axis: log likelihood score for the fit, as an average over 13 
runs plus or minus one standard deviation. (Left) Fine-scale plot, k =2 to 60 in 
increments of 2. (Right) Co arse-scale plot, k = 2 to 202 in increments of 10. Both plots 
showan ex tended plateau of relatively likely fits between roughly fc =1 4 andfc =40. 

From Figures 1 and 2 and the synthetic data experiments mentioned above, we can 
guess at appropriate values for k which take into account both the measured 
likelihood of clustering and the requirements for few parameters in circuit-fitting. 



For example choosing £=15 clusters would put us at the beginning of the plateau, 
losing very little cluster likelihood in return for reducing the aggregate genes circuit 
size from 27 to 15 players. The interpretation would be that there are about 15 
superclusters in hierarchically clustered data, to which we will fit a 15-player 
regulatory circuit. Much more aggrfesive would be to pick £=7 or 8 clusters, for a 
relatively significant ckop in log-likelihood in return for a further substantial 
decrease in circuit size. An acceptable range of cluster numbers (and circuit sizes) 
would seem to be £= 8 to 15. 

3.3 Gene Circuit Inference 

It proved possible to fit the £=15 time course using weight decay W - 1 but without 
using hidden units. W = 0 and W - 3 gave less satisfactory results. Four of the 15 
clusters are shown in Figure 3 for one good run (W=l). Scores for our first few 
(unselected) runs at the current parameter settings are shown in Table 1. Each run 
took between 24 and 48 hours on one processor of an Sun Ultras pare 60 computer. 
Even with weight decay, it is possible that successful fits are really overfits with 
this particular data since there are about twice as many parameters as data points. 


Weight Decay 
W 

Score 

Simulated Annealing 
Moves 

Notes 

0 

1.391 

3140000 


0 

1.656 

2310000 


1 

0.528 

3010000 

Figure 3 

1 

1.050 

2710000 

Qualitative fit for most 
cl ustere 

3 

1.417 

2790000 

Poor fit 


Table 1, Score function parameters were A=1 .0, B=0.01 . Anneal ing runs statistics are 
reported when the temperature droppedbelow 0. 0001 . The two lowest -scoring (best) runs 
occurred for W= 1 . More runs will determine whether weight decay W~\ is a necessary 
condition fora goodfi t, or whether one just needs to take the best of N runs and/or slow 
do wn the si mulated anneal ing temperature control. 

There were a few significant similarities between the connection matrices computed 
in the two lowest -scoring runs. Perhaps the most salient feature in the lowest- 
scoring network was a set of direct feedback loops among its strongest connections: 
cluster 8 both excited and was inhibited by cluster 10, and cluster 10 excited and was 
inhibited by cluster 15. This feature was preserved in the second-best run. 
However, a systematic search for “concensus circuitry” awaits more simulated 
annealing runs. From parameter- counting one might expect that making robust and 
unique regulatory predictions will require the use of more trajectory data taken under 
substantially different conditions. Such data is expected to be forthcoming as 
hybridization expression technology is widely adopted 

4 Discussion 

We have illustrated a procedure for deriving regulatory models from large-scale gene 
expression data. As the data becomes more comprehensive in the number and nature 
of conditions under which comparable time courses are measured, this procedure can 
be used to determine when biological hypotheses about gene regulation can be 
robustly derived from the data 
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Figure 3. Four clusters (numbers 9-12) of a 15-cluster mixture of Gaussians model of 
2467 genes each assayed over an eight- point ti me course; cluster means (shown as x) are 
fitto a gene regulation networkmodel (shown as o). 
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